In this document we’re interested in generating predictions for every pairing, using Elo scores alone. We don’t expect this to do terribly well, but it’s a baseline.

Geddit?

Let’s read in the Elo ratings we downloaded from the Web.

elo_m <- read.csv('data/elo_ratings/atp22.csv')
head(elo_m)
Rank Player Age Elo HardRaw ClayRaw GrassRaw hElo cElo gElo Peak.Match Peak.Age Peak.Elo
1 Carlos Alcaraz 19.0 2200.2 2028.5 2038.1 1441.4 2114.4 2119.2 1820.8 2022 Roland Garros R16 19.0 2221.9
2 Novak Djokovic 35.0 2168.6 2054.9 2022.3 1942.4 2111.7 2095.4 2055.5 2016 Miami F 28.8 2470.0
3 Alexander Zverev 25.1 2131.3 2012.7 2043.9 1671.4 2072.0 2087.6 1901.4 2022 Atp Cup RR 24.7 2158.4
4 Rafael Nadal 36.0 2095.8 1934.9 1967.6 1500.0 2015.3 2031.7 1797.9 2009 Madrid SF 22.9 2370.0
5 Jannik Sinner 20.8 2070.0 1969.9 1896.4 1312.8 2020.0 1983.2 1691.4 2022 Madrid R32 20.7 2079.0
6 Stefanos Tsitsipas 23.8 2058.2 1899.1 2040.2 1572.4 1978.6 2049.2 1815.3 2021 Roland Garros SF 22.8 2133.0

According to the web site, the gElo column is a 50/50 average between the all-surfaces Elo score and the grass-only Elo score. That’s their recommendation, though we may or may not follow it.

We can convert Elo scores to Bradley–Terry abilities. Let \(a_i\) represent the Elo score of player \(i\). Then the probability that player \(i\) defeats player \(j\) (ignoring the possibility of a draw) is given by

\[ p_{ij} = 1 - \frac1{1 + 10^{(a_i - a_j)/400}} = \frac{10^{a_i/400}}{ 10^{a_i/400} + 10^{a_j/400}}, \] so that \[ \frac{p_{ij}}{p_{ji}} = \frac{10^{a_i/400}}{10^{a_j/400}}, \] hence \[ \ln \frac{p_{ij}}{p_{ji}} = a_i \frac{\ln 10}{400} - a_j \frac{\ln 10}{400}. \] Thus one can convert an Elo rating into a Bradley–Terry (log)-score by multiplying it by \(\frac1{400}\ln 10\).

elo_prob <- function(a1, a2) {
  1 / (1 + 10^((a2 - a1) / 400))
}

So the probability that Carlos Alcaraz beats Novak Djokovic is

elo_prob(elo_m[1, 'Elo'], elo_m[2, 'Elo'])
## [1] 0.5453511

Now to make some predictions. Let’s pull in the template with all the pairings. Watch out for non-Ascii letters! (Make sure to set the encoding to UTF-8.)

template <- read.csv('submission-template.csv', encoding = 'UTF-8')
head(template, 10)
player1_name player2_name player1_id player2_id Gender p_player1_win p_player2_win
iga swiatek barbora krejcikova 1 2 W NA NA
iga swiatek paula badosa 1 3 W NA NA
iga swiatek maria sakkari 1 4 W NA NA
iga swiatek anett kontaveit 1 5 W NA NA
iga swiatek karolina pliskova 1 6 W NA NA
iga swiatek ons jabeur 1 7 W NA NA
iga swiatek aryna sabalenka 1 8 W NA NA
iga swiatek danielle collins 1 9 W NA NA
iga swiatek garbiñe muguruza 1 10 W NA NA
iga swiatek jessica pegula 1 11 W NA NA

Are all the male players in our table?

library(dplyr)
men <- with(subset(template, Gender == 'M'),
            union(player1_name, player2_name))

Watch out for encodings or invisible unicode characters (like non-breaking spaces) in data that’s scraped from the web!

I have fixed this in webscraping.Rmd so the following should now work.

'novak djokovic' %in% men
## [1] TRUE
tolower(elo_m$Player[2])
## [1] "novak djokovic"
'novak djokovic' == tolower(elo_m$Player[2])
## [1] TRUE

Or is it non-breaking spaces?

stringi::stri_escape_unicode('novak djokovic')
## [1] "novak djokovic"
stringi::stri_escape_unicode(elo_m$Player[2])
## [1] "Novak Djokovic"

Now, who in the submission template is missing from the scraped Elo ratings?

men[!men %in% tolower(elo_m$Player)]
## [1] "felix auger-aliassime" "albert ramos-vinolas"  "roger federer"        
## [4] "soonwoo kwon"          "jan-lennard struff"

Are they really missing?

library(stringr)
str_subset(elo_m$Player, 'elix') # hyphens
## [1] "Felix Auger Aliassime"
str_subset(elo_m$Player, 'amos') # double-barrelled name
## [1] "Albert Ramos"
str_subset(elo_m$Player, 'oger')
## character(0)
str_subset(elo_m$Player, 'won') # spacing
## [1] "Soon Woo Kwon"
str_subset(elo_m$Player, 'nnard') # hyphens
## [1] "Jan Lennard Struff"

And now the same for women:

elo_w <- read.csv('data/elo_ratings/wta22.csv')
women <- with(subset(template, Gender == 'W'),
              union(player1_name, player2_name))
women[!women %in% tolower(elo_w$Player)]
## [1] "garbiñe muguruza"    "coco gauff"          "alizé cornet"       
## [4] "elena-gabriela ruse" "irina-camelia begu"  "xinyu wang"

Diagnose the issues:

str_subset(elo_w$Player, 'arbi') # diacritics
## [1] "Garbine Muguruza"
str_subset(elo_w$Player, 'auf')  # different forename?
## [1] "Cori Gauff"
str_subset(elo_w$Player, 'orne') # diacritics
## [1] "Alize Cornet"
str_subset(elo_w$Player, 'a ?[Gg]abr') # hyphenation
## [1] "Elena Gabriela Ruse"
str_subset(elo_w$Player, 'elia') # hyphenation
## [1] "Irina Camelia Begu"
str_subset(elo_w$Player, 'in ?[Yy]u') # spacing
## [1] "Xin Yu Wang"